This dataset is collected and downloaded from Gapminder world which has collected a lot of information about how people live their lives in different countries, tracked across the years, and on a number of different indicators.
I choose for this study four individual indicators from Gapminder datasets, one dependent indicator, total GDP is US dollars and three independent indicators: Employment rate, average schooling years (OWID) and total electricity generation in kilowatt.
I choose GDP as the key dependent indicator for this study due to its importance in measuring countries’ economies. It gives essential information on the size of an economy and how is it performing. The general health of an economy is often indicated by real GDP growth rate. For instance, an increase in real GDP is interpreted as a sign of good performing economy (source: IMF).
#importing all necessary packages for this report analysis including visulization packages such as matplotlib and plotly.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns
In this section of this report, I will load the data, check its cleanliness, trim and clean and merge datasets for analysis. further explanation of the step-by-step data wrangling process will be included along the coding process.
First, we load the datasets and dive into properties of each, review it and analyze it for possible cleaning and modification.
#loading the data using pandas and getting overview about it
df_gdp = pd.read_csv("total_gdp_us_inflation_adjusted.csv", sep = ",")
df_owid = pd.read_csv("owid_education_idx.csv", sep = ",")
df_elect_gen = pd.read_csv("electricity_generation_total.csv", sep = ",")
df_emp = pd.read_csv("aged_15plus_employment_rate_percent.csv", sep = ",")
df_gdp.head(3)
| country | 1960 | 1961 | 1962 | 1963 | 1964 | 1965 | 1966 | 1967 | 1968 | ... | 2010 | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Afghanistan | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | 1.590000e+10 | 1.590000e+10 | 1.800000e+10 | 1.900000e+10 | 1.950000e+10 | 1.980000e+10 | 2.020000e+10 | 2.070000e+10 | 2.110000e+10 | 2.170000e+10 |
| 1 | Albania | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | 1.190000e+10 | 1.220000e+10 | 1.240000e+10 | 1.250000e+10 | 1.280000e+10 | 1.300000e+10 | 1.350000e+10 | 1.400000e+10 | 1.450000e+10 | 1.490000e+10 |
| 2 | Algeria | 2.740000e+10 | 2.370000e+10 | 1.900000e+10 | 2.560000e+10 | 2.710000e+10 | 2.870000e+10 | 2.740000e+10 | 3.000000e+10 | 3.320000e+10 | ... | 1.610000e+11 | 1.660000e+11 | 1.720000e+11 | 1.760000e+11 | 1.830000e+11 | 1.900000e+11 | 1.960000e+11 | 1.980000e+11 | 2.010000e+11 | 2.030000e+11 |
3 rows × 61 columns
df_owid.head(3)
| country | 1870 | 1871 | 1872 | 1873 | 1874 | 1875 | 1876 | 1877 | 1878 | ... | 2008 | 2009 | 2010 | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Afghanistan | 0.000667 | NaN | NaN | NaN | NaN | 0.000667 | NaN | NaN | NaN | ... | 0.200 | 0.207 | 0.213 | 0.220 | 0.227 | 0.233 | 0.233 | 0.240 | 0.240 | 0.253 |
| 1 | Albania | 0.001330 | NaN | NaN | NaN | NaN | 0.001330 | NaN | NaN | NaN | ... | 0.613 | 0.620 | 0.620 | 0.620 | 0.640 | 0.647 | 0.647 | 0.647 | 0.667 | 0.667 |
| 2 | Algeria | 0.005330 | NaN | NaN | NaN | NaN | 0.006000 | NaN | NaN | NaN | ... | 0.447 | 0.460 | 0.473 | 0.493 | 0.507 | 0.520 | 0.527 | 0.527 | 0.533 | 0.533 |
3 rows × 149 columns
Note that the end-year in the OWID dataset is year 2017, it also has data for years way before other datasets
df_elect_gen.head(3)
| country | 1985 | 1986 | 1987 | 1988 | 1989 | 1990 | 1991 | 1992 | 1993 | ... | 2010 | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Algeria | 1.230000e+10 | 1.300000e+10 | 1.270000e+10 | 1.400000e+10 | 1.530000e+10 | 1.610000e+10 | 1.730000e+10 | 1.830000e+10 | 1.940000e+10 | ... | 4.570000e+10 | 5.310000e+10 | 5.740000e+10 | 5.990000e+10 | 6.420000e+10 | 6.880000e+10 | 7.100000e+10 | 7.600000e+10 | 7.670000e+10 | 8.130000e+10 |
| 1 | Argentina | 4.530000e+10 | 4.900000e+10 | 5.210000e+10 | 5.250000e+10 | 5.090000e+10 | 5.100000e+10 | 5.390000e+10 | 5.610000e+10 | 6.190000e+10 | ... | 1.260000e+11 | 1.290000e+11 | 1.360000e+11 | 1.390000e+11 | 1.390000e+11 | 1.450000e+11 | 1.470000e+11 | 1.460000e+11 | 1.470000e+11 | 1.400000e+11 |
| 2 | Australia | 1.240000e+11 | 1.300000e+11 | 1.360000e+11 | 1.430000e+11 | 1.510000e+11 | 1.560000e+11 | 1.580000e+11 | 1.610000e+11 | 1.650000e+11 | ... | 2.510000e+11 | 2.560000e+11 | 2.510000e+11 | 2.500000e+11 | 2.480000e+11 | 2.550000e+11 | 2.580000e+11 | 2.590000e+11 | 2.630000e+11 | 2.650000e+11 |
3 rows × 36 columns
df_emp.head(3)
| country | 1991 | 1992 | 1993 | 1994 | 1995 | 1996 | 1997 | 1998 | 1999 | ... | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | 2020 | 2021 | 2022 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Afghanistan | 0.453 | 0.462 | 0.454 | 0.455 | 0.474 | 0.460 | 0.460 | 0.460 | 0.459 | ... | 0.484 | 0.487 | 0.491 | 0.494 | 0.494 | 0.495 | 0.495 | 0.495 | 0.496 | 0.496 |
| 1 | Albania | 0.508 | 0.506 | 0.487 | 0.484 | 0.481 | 0.482 | 0.489 | 0.488 | 0.483 | ... | 0.453 | 0.449 | 0.468 | 0.477 | 0.483 | 0.474 | 0.473 | 0.472 | 0.472 | 0.471 |
| 2 | Algeria | 0.344 | 0.337 | 0.331 | 0.326 | 0.307 | 0.321 | 0.332 | 0.324 | 0.315 | ... | 0.394 | 0.371 | 0.368 | 0.372 | 0.372 | 0.372 | 0.372 | 0.371 | 0.369 | 0.367 |
3 rows × 33 columns
Note that the start-year in employment rate dataset is year 1991, also this dataset includes only 33 columns
#short descrptive analysis of gdp data:
df_gdp.describe()
| 1960 | 1961 | 1962 | 1963 | 1964 | 1965 | 1966 | 1967 | 1968 | 1969 | ... | 2010 | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 8.700000e+01 | 8.900000e+01 | 8.900000e+01 | 8.900000e+01 | 8.900000e+01 | 9.300000e+01 | 9.600000e+01 | 9.700000e+01 | 9.900000e+01 | 9.900000e+01 | ... | 1.910000e+02 | 1.890000e+02 | 1.880000e+02 | 1.880000e+02 | 1.880000e+02 | 1.870000e+02 | 1.860000e+02 | 1.860000e+02 | 1.850000e+02 | 1.740000e+02 |
| mean | 1.010542e+11 | 1.032279e+11 | 1.088010e+11 | 1.146303e+11 | 1.222843e+11 | 1.237549e+11 | 1.268994e+11 | 1.312804e+11 | 1.377284e+11 | 1.460788e+11 | ... | 3.412494e+11 | 3.554197e+11 | 3.665958e+11 | 3.764052e+11 | 3.867375e+11 | 3.978542e+11 | 4.106097e+11 | 4.237589e+11 | 4.364900e+11 | 4.741701e+11 |
| std | 3.632180e+11 | 3.710472e+11 | 3.928904e+11 | 4.122954e+11 | 4.369644e+11 | 4.547580e+11 | 4.771313e+11 | 4.911570e+11 | 5.137150e+11 | 5.354846e+11 | ... | 1.309248e+12 | 1.346066e+12 | 1.389708e+12 | 1.427030e+12 | 1.465003e+12 | 1.518994e+12 | 1.562798e+12 | 1.609197e+12 | 1.675192e+12 | 1.772224e+12 |
| min | 9.940000e+07 | 1.040000e+08 | 1.090000e+08 | 1.150000e+08 | 1.200000e+08 | 1.260000e+08 | 1.320000e+08 | 1.360000e+08 | 1.450000e+08 | 1.490000e+08 | ... | 3.180000e+07 | 3.420000e+07 | 3.290000e+07 | 3.440000e+07 | 3.490000e+07 | 3.810000e+07 | 3.920000e+07 | 4.080000e+07 | 4.180000e+07 | 4.590000e+07 |
| 25% | 2.350000e+09 | 2.510000e+09 | 2.700000e+09 | 2.840000e+09 | 2.990000e+09 | 3.210000e+09 | 3.390000e+09 | 3.530000e+09 | 3.730000e+09 | 3.955000e+09 | ... | 6.385000e+09 | 7.240000e+09 | 7.415000e+09 | 7.932500e+09 | 8.267500e+09 | 8.435000e+09 | 8.867500e+09 | 9.267500e+09 | 9.770000e+09 | 1.172500e+10 |
| 50% | 8.500000e+09 | 7.250000e+09 | 7.550000e+09 | 7.880000e+09 | 8.060000e+09 | 7.710000e+09 | 7.995000e+09 | 8.390000e+09 | 8.800000e+09 | 9.190000e+09 | ... | 2.490000e+10 | 2.590000e+10 | 2.740000e+10 | 2.885000e+10 | 2.990000e+10 | 3.080000e+10 | 3.195000e+10 | 3.320000e+10 | 3.370000e+10 | 3.980000e+10 |
| 75% | 6.350000e+10 | 6.410000e+10 | 6.520000e+10 | 6.990000e+10 | 7.340000e+10 | 7.700000e+10 | 7.100000e+10 | 6.960000e+10 | 7.500000e+10 | 8.080000e+10 | ... | 1.545000e+11 | 1.660000e+11 | 1.722500e+11 | 1.800000e+11 | 1.842500e+11 | 1.890000e+11 | 2.005000e+11 | 2.087500e+11 | 2.060000e+11 | 2.485000e+11 |
| max | 3.170000e+12 | 3.250000e+12 | 3.440000e+12 | 3.600000e+12 | 3.800000e+12 | 4.050000e+12 | 4.310000e+12 | 4.420000e+12 | 4.630000e+12 | 4.770000e+12 | ... | 1.500000e+13 | 1.520000e+13 | 1.560000e+13 | 1.590000e+13 | 1.620000e+13 | 1.670000e+13 | 1.700000e+13 | 1.730000e+13 | 1.790000e+13 | 1.830000e+13 |
8 rows × 60 columns
#checking for duplicates
print(df_gdp.duplicated().sum())
print(df_owid.duplicated().sum())
print(df_elect_gen.duplicated().sum())
print(df_emp.duplicated().sum())
0 0 0 0
#getting overview of the dataframes shpes in terms of rows and columns
print(df_gdp.shape)
print(df_owid.shape)
print(df_elect_gen.shape)
print(df_emp.shape)
(191, 61) (187, 149) (77, 36) (179, 33)
Data cleaning will be performed step by step to reach the best possible form to analyze and explore the data to answer the underlying questions
Before checking for non-values, for the sake of comparison and clear analysis between given datasets, I will drop columns, so the datasets are matching in terms of years. For instance we can see from the datasets shapes, that the dataset with the smallest number of columns is employment rate dataset. but we also notice that the OWID dataset has in its columns the end-year is set to 2017 which is coming before the end-year in other datasets. Hence, as a first step I will drop years columns before year 1991 to reach the same start-year column of the lowest dataset in columns shape. Furthermore, I will also drop columns of years after 2017 in a second step.
Step one: dropping columns before 1991
#dropping columns from datasets using their index
df_gdp.drop(df_gdp.columns[1:32], axis=1, inplace = True)
df_owid.drop(df_owid.columns[1:122], axis=1, inplace = True)
df_elect_gen.drop(df_elect_gen.columns[1:7], axis=1, inplace = True)
Step two: dropping columns after 2017
#dropping columns from datasets using their index
df_gdp.drop(df_gdp.columns[28:30], axis=1, inplace = True)
df_elect_gen.drop(df_elect_gen.columns[28:30], axis=1, inplace = True)
df_emp.drop(df_emp.columns[28:33], axis=1, inplace = True)
Now that we have a specific timeline for our analyses after matching all datasets in terms of columns, the second step would be to drop rows including null values. I choose to do this step later after dropping the columns in order for us not to lose too many country data as many countries had null values in the early years of the most of the datasets. After this step we will look into the best way to join the data together for the purpose of exploring the underlaying questions.
#exploring Null values information for all datasets
df_gdp.isnull().sum()
country 0 1991 26 1992 23 1993 21 1994 19 1995 13 1996 13 1997 11 1998 11 1999 10 2000 6 2001 5 2002 4 2003 4 2004 3 2005 3 2006 3 2007 3 2008 2 2009 2 2010 0 2011 2 2012 3 2013 3 2014 3 2015 4 2016 5 2017 5 dtype: int64
df_owid.isnull().sum()
country 0 1991 41 1992 41 1993 41 1994 41 1995 38 1996 38 1997 38 1998 38 1999 36 2000 17 2001 17 2002 16 2003 14 2004 11 2005 5 2006 4 2007 5 2008 4 2009 4 2010 3 2011 2 2012 2 2013 2 2014 2 2015 2 2016 2 2017 2 dtype: int64
df_elect_gen.isnull().sum()
country 0 1991 0 1992 0 1993 0 1994 0 1995 0 1996 0 1997 0 1998 0 1999 0 2000 0 2001 0 2002 0 2003 0 2004 0 2005 0 2006 0 2007 0 2008 0 2009 0 2010 0 2011 0 2012 0 2013 0 2014 0 2015 0 2016 0 2017 0 dtype: int64
df_emp.isnull().sum()
country 0 1991 0 1992 0 1993 0 1994 0 1995 0 1996 0 1997 0 1998 0 1999 0 2000 0 2001 0 2002 0 2003 0 2004 0 2005 0 2006 0 2007 0 2008 0 2009 0 2010 0 2011 0 2012 0 2013 0 2014 0 2015 0 2016 0 2017 0 dtype: int64
Notice employment rate and electricity datasets has no null values
#dropping null values from gdp and owid datasets
df_gdp.dropna(inplace=True)
df_owid.dropna(inplace=True)
Although merging the given dataframes is essential for exploring some of this study's questions. I choose to do it at a later stage as other questions will need us to work on individual datasets to explore them. This is also why; I choose to drop the null values from each dataset individually instead of dropping them after merging the datasets.
In this section I will trim and tone GDP dataset to create a new dataset that will help us explore the first two questions of this study, to do this I will implement the following steps:
#adding gdp_total_growth column to gdp dataset
df_gdp['gdp_total_growth'] = df_gdp['2017'] - df_gdp['1991']
#creating new gpd dataset including gdp_total_growth for the porpuse of answering research questions
df_gdp1 = df_gdp.copy()
#finding the max value for gdp_total_growth column
max_growth_gdp = df_gdp['gdp_total_growth'].max()
max_growth_gdp
9295000000000.0
#finding the min value for gdp_total_growth column
min_growth_gdp = df_gdp['gdp_total_growth'].min()
min_growth_gdp
-61000000000.0
#creating dataframe including both max and min countries rows using indexing and bool
chi_ukr_gdp = df_gdp[(df_gdp['gdp_total_growth'] == min_growth_gdp)|(df_gdp['gdp_total_growth'] == max_growth_gdp)]
chi_ukr_gdp
| country | 1991 | 1992 | 1993 | 1994 | 1995 | 1996 | 1997 | 1998 | 1999 | ... | 2009 | 2010 | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | gdp_total_growth | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 35 | China | 9.050000e+11 | 1.030000e+12 | 1.180000e+12 | 1.330000e+12 | 1.480000e+12 | 1.620000e+12 | 1.770000e+12 | 1.910000e+12 | 2.060000e+12 | ... | 5.500000e+12 | 6.090000e+12 | 6.670000e+12 | 7.190000e+12 | 7.750000e+12 | 8.330000e+12 | 8.910000e+12 | 9.520000e+12 | 1.020000e+13 | 9.295000e+12 |
| 179 | Ukraine | 1.880000e+11 | 1.700000e+11 | 1.460000e+11 | 1.120000e+11 | 9.880000e+10 | 8.890000e+10 | 8.620000e+10 | 8.460000e+10 | 8.440000e+10 | ... | 1.310000e+11 | 1.360000e+11 | 1.430000e+11 | 1.440000e+11 | 1.440000e+11 | 1.340000e+11 | 1.210000e+11 | 1.240000e+11 | 1.270000e+11 | -6.100000e+10 |
2 rows × 29 columns
Here, I will trim chi_ukr_gdp dataset to create a new dataset, then I will miniplate this new dataset to help us explore the 3rd question in this study. I will need to convert rows of the data set into columns to be able to use them as basis for my analysis, to do this I will apply the following:
#setting country column to index using set_index method
#dropping gdp_total_growth column
#then switching columns with rows using transpose
chi_ukr_gdp_adj = chi_ukr_gdp.set_index("country")
chi_ukr_gdp_adj.drop("gdp_total_growth", axis=1, inplace = True)
chi_ukr_gdp_adj = chi_ukr_gdp_adj.T
#creating year column using index
chi_ukr_gdp_adj['year'] = chi_ukr_gdp_adj.index
chi_ukr_gdp_adj
| country | China | Ukraine | year |
|---|---|---|---|
| 1991 | 9.050000e+11 | 1.880000e+11 | 1991 |
| 1992 | 1.030000e+12 | 1.700000e+11 | 1992 |
| 1993 | 1.180000e+12 | 1.460000e+11 | 1993 |
| 1994 | 1.330000e+12 | 1.120000e+11 | 1994 |
| 1995 | 1.480000e+12 | 9.880000e+10 | 1995 |
| 1996 | 1.620000e+12 | 8.890000e+10 | 1996 |
| 1997 | 1.770000e+12 | 8.620000e+10 | 1997 |
| 1998 | 1.910000e+12 | 8.460000e+10 | 1998 |
| 1999 | 2.060000e+12 | 8.440000e+10 | 1999 |
| 2000 | 2.230000e+12 | 8.940000e+10 | 2000 |
| 2001 | 2.420000e+12 | 9.760000e+10 | 2001 |
| 2002 | 2.640000e+12 | 1.030000e+11 | 2002 |
| 2003 | 2.900000e+12 | 1.130000e+11 | 2003 |
| 2004 | 3.200000e+12 | 1.260000e+11 | 2004 |
| 2005 | 3.560000e+12 | 1.300000e+11 | 2005 |
| 2006 | 4.020000e+12 | 1.400000e+11 | 2006 |
| 2007 | 4.590000e+12 | 1.500000e+11 | 2007 |
| 2008 | 5.030000e+12 | 1.540000e+11 | 2008 |
| 2009 | 5.500000e+12 | 1.310000e+11 | 2009 |
| 2010 | 6.090000e+12 | 1.360000e+11 | 2010 |
| 2011 | 6.670000e+12 | 1.430000e+11 | 2011 |
| 2012 | 7.190000e+12 | 1.440000e+11 | 2012 |
| 2013 | 7.750000e+12 | 1.440000e+11 | 2013 |
| 2014 | 8.330000e+12 | 1.340000e+11 | 2014 |
| 2015 | 8.910000e+12 | 1.210000e+11 | 2015 |
| 2016 | 9.520000e+12 | 1.240000e+11 | 2016 |
| 2017 | 1.020000e+13 | 1.270000e+11 | 2017 |
In this section I will trim and tone OWID dataset to create a new dataset that will help us explore more questions of this study, to do this I will implement the following steps:
#adding new column using index
df_owid['inc_edu'] = df_owid['2017'] - df_owid['1991']
#creating dataframe including both Ukraine and China countries rows using indexing and bool
chi_ukr_edu = df_owid[(df_owid['country'] == "Ukraine")|(df_owid['country'] == "China")]
#dropping inc_edu from the original owid dataset
df_owid=df_owid.drop("inc_edu", axis=1)
df_owid.head(2)
| country | 1991 | 1992 | 1993 | 1994 | 1995 | 1996 | 1997 | 1998 | 1999 | ... | 2008 | 2009 | 2010 | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Afghanistan | 0.107 | 0.107 | 0.113 | 0.12 | 0.127 | 0.127 | 0.133 | 0.133 | 0.140 | ... | 0.200 | 0.207 | 0.213 | 0.22 | 0.227 | 0.233 | 0.233 | 0.240 | 0.240 | 0.253 |
| 1 | Albania | 0.520 | 0.520 | 0.520 | 0.52 | 0.533 | 0.547 | 0.553 | 0.567 | 0.573 | ... | 0.613 | 0.620 | 0.620 | 0.62 | 0.640 | 0.647 | 0.647 | 0.647 | 0.667 | 0.667 |
2 rows × 28 columns
In this section I will trim and tone electricity generation dataset to create a new dataset that will help us explore more questions of this study, to do this I will implement the following steps:
#creating dataframe including both Ukraine and China countries rows using indexing and bool
chi_ukr_elect = df_elect_gen[(df_elect_gen['country'] == "Ukraine")|(df_elect_gen['country'] == "China")]
#setting country column to index using set_index method then switching columns with rows using transpose
chi_ukr_elect_adj = chi_ukr_elect.set_index("country")
chi_ukr_elect_adj = chi_ukr_elect_adj.T
#creating year column using index
chi_ukr_elect_adj['year'] = chi_ukr_elect_adj.index
In this section I will trim and tone employment rate dataset to create a new dataset that will help us explore more questions of this study, to do this I will implement the following steps:
#adding new column using index
df_emp['increase_emp'] = df_emp['2017'] - df_emp['1991']
#creating dataframe including both Ukraine and China countries rows using indexing and bool
chi_ukr_emp = df_emp[(df_emp['country'] == "Ukraine")|(df_emp['country'] == "China")]
#dropping increase_emp from the original emp dataset
df_emp.drop('increase_emp', axis=1, inplace = True)
As a final step of the data wrangling, I will merge all the 4 datasets included in this study into one big dataset. the merge will be done using merge inner method to include only shared years data of all the datasets. this will be done following these steps:
#turnning rows into columns using pandas melt method
df_gdp1 = df_gdp.melt(id_vars=["country"],
var_name="Year",
value_name="gdp")
#turnning rows into columns using pandas melt method
df_owid1 = df_owid.melt(id_vars=["country"],
var_name="Year",
value_name="owid")
#turnning rows into columns using pandas melt method
df_elect_gen1 = df_elect_gen.melt(id_vars=["country"],
var_name="Year",
value_name="elect_gen")
#turnning rows into columns using pandas melt method
df_emp1 = df_emp.melt(id_vars=["country"],
var_name="Year",
value_name="emp")
#merging dataframes using merge method, inner way
df_combined1 = df_gdp1.merge(df_owid1, on =['country', 'Year'], how='inner')
#merging dataframes using merge method, inner way
df_combined2 = df_emp1.merge(df_elect_gen1, on =['country', 'Year'], how='inner')
#merging dataframes using merge method, inner way
df_combined_all = df_combined1.merge(df_combined2, on =['country', 'Year'], how='inner')
df_combined_all.head(2)
| country | Year | gdp | owid | emp | elect_gen | |
|---|---|---|---|---|---|---|
| 0 | Algeria | 1991 | 9.090000e+10 | 0.253 | 0.344 | 1.730000e+10 |
| 1 | Argentina | 1991 | 2.220000e+11 | 0.533 | 0.565 | 5.390000e+10 |
chi_ukr_gdp
| country | 1991 | 1992 | 1993 | 1994 | 1995 | 1996 | 1997 | 1998 | 1999 | ... | 2009 | 2010 | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | gdp_total_growth | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 35 | China | 9.050000e+11 | 1.030000e+12 | 1.180000e+12 | 1.330000e+12 | 1.480000e+12 | 1.620000e+12 | 1.770000e+12 | 1.910000e+12 | 2.060000e+12 | ... | 5.500000e+12 | 6.090000e+12 | 6.670000e+12 | 7.190000e+12 | 7.750000e+12 | 8.330000e+12 | 8.910000e+12 | 9.520000e+12 | 1.020000e+13 | 9.295000e+12 |
| 179 | Ukraine | 1.880000e+11 | 1.700000e+11 | 1.460000e+11 | 1.120000e+11 | 9.880000e+10 | 8.890000e+10 | 8.620000e+10 | 8.460000e+10 | 8.440000e+10 | ... | 1.310000e+11 | 1.360000e+11 | 1.430000e+11 | 1.440000e+11 | 1.440000e+11 | 1.340000e+11 | 1.210000e+11 | 1.240000e+11 | 1.270000e+11 | -6.100000e+10 |
2 rows × 29 columns
fig1 = px.bar(chi_ukr_gdp,
x = chi_ukr_gdp["country"],
y = chi_ukr_gdp["gdp_total_growth"],
title = 'GDP growth 1991-2017 in USD',
color=["China", "Ukraine"],
text=chi_ukr_gdp["gdp_total_growth"]
)
fig1.show()
Within the final GDP dataset used in this study, China is the country with the highest growth in total GDP with a total amount of 9295000000000 $ while Ukraine had the highest decline in GDP during the time period from 1991 till 2017
df_ukr_chi_gdp = chi_ukr_gdp_adj[['Ukraine','China','year']]
df_ukr_chi_gdp.plot(figsize=(12,8))
plt.title('Largest vs Lowest GDP Growth Time-series')
plt.ylabel('Total GDP Growth USD')
plt.grid(True)
We can see from the graph that while Ukraine had a constant decline in GDP over the years, China's GDP has been rapidly growing
top_largest= df_gdp.nlargest(5, ['gdp_total_growth'])
top_largest.head()
| country | 1991 | 1992 | 1993 | 1994 | 1995 | 1996 | 1997 | 1998 | 1999 | ... | 2009 | 2010 | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | gdp_total_growth | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 35 | China | 9.050000e+11 | 1.030000e+12 | 1.180000e+12 | 1.330000e+12 | 1.480000e+12 | 1.620000e+12 | 1.770000e+12 | 1.910000e+12 | 2.060000e+12 | ... | 5.500000e+12 | 6.090000e+12 | 6.670000e+12 | 7.190000e+12 | 7.750000e+12 | 8.330000e+12 | 8.910000e+12 | 9.520000e+12 | 1.020000e+13 | 9.295000e+12 |
| 182 | United States | 8.990000e+12 | 9.310000e+12 | 9.560000e+12 | 9.950000e+12 | 1.020000e+13 | 1.060000e+13 | 1.110000e+13 | 1.160000e+13 | 1.210000e+13 | ... | 1.460000e+13 | 1.500000e+13 | 1.520000e+13 | 1.560000e+13 | 1.590000e+13 | 1.620000e+13 | 1.670000e+13 | 1.700000e+13 | 1.730000e+13 | 8.310000e+12 |
| 76 | India | 5.130000e+11 | 5.410000e+11 | 5.670000e+11 | 6.040000e+11 | 6.500000e+11 | 6.990000e+11 | 7.280000e+11 | 7.730000e+11 | 8.410000e+11 | ... | 1.540000e+12 | 1.680000e+12 | 1.760000e+12 | 1.860000e+12 | 1.980000e+12 | 2.130000e+12 | 2.290000e+12 | 2.480000e+12 | 2.660000e+12 | 2.147000e+12 |
| 84 | Japan | 4.860000e+12 | 4.910000e+12 | 4.880000e+12 | 4.930000e+12 | 5.060000e+12 | 5.220000e+12 | 5.280000e+12 | 5.220000e+12 | 5.200000e+12 | ... | 5.470000e+12 | 5.700000e+12 | 5.690000e+12 | 5.780000e+12 | 5.890000e+12 | 5.920000e+12 | 5.990000e+12 | 6.020000e+12 | 6.150000e+12 | 1.290000e+12 |
| 181 | United Kingdom | 1.620000e+12 | 1.630000e+12 | 1.670000e+12 | 1.740000e+12 | 1.780000e+12 | 1.820000e+12 | 1.890000e+12 | 1.960000e+12 | 2.030000e+12 | ... | 2.430000e+12 | 2.480000e+12 | 2.510000e+12 | 2.550000e+12 | 2.610000e+12 | 2.670000e+12 | 2.740000e+12 | 2.790000e+12 | 2.840000e+12 | 1.220000e+12 |
5 rows × 29 columns
top_largest[['gdp_total_growth', 'country']]
df_MonthVsProfit= top_largest[['gdp_total_growth', 'country']]
df_MonthVsProfit= df_MonthVsProfit.groupby('country').mean()
df_MonthVsProfit = px.bar(df_MonthVsProfit, y="gdp_total_growth", title="Top Five Countries with highest total GDP growth 1991-2017")
df_MonthVsProfit.show()
The top 5 countries that have incurred the highest GDP growth between 1991-2017 are:
smallest_five= df_gdp.nsmallest(5, ['gdp_total_growth'])
smallest_five
| country | 1991 | 1992 | 1993 | 1994 | 1995 | 1996 | 1997 | 1998 | 1999 | ... | 2009 | 2010 | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | gdp_total_growth | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 179 | Ukraine | 1.880000e+11 | 1.700000e+11 | 1.460000e+11 | 1.120000e+11 | 9.880000e+10 | 8.890000e+10 | 8.620000e+10 | 8.460000e+10 | 8.440000e+10 | ... | 1.310000e+11 | 1.360000e+11 | 1.430000e+11 | 1.440000e+11 | 1.440000e+11 | 1.340000e+11 | 1.210000e+11 | 1.240000e+11 | 1.270000e+11 | -6.100000e+10 |
| 177 | Tuvalu | 2.220000e+07 | 2.280000e+07 | 2.380000e+07 | 2.620000e+07 | 2.490000e+07 | 2.340000e+07 | 2.580000e+07 | 2.980000e+07 | 2.930000e+07 | ... | 3.270000e+07 | 3.180000e+07 | 3.420000e+07 | 3.290000e+07 | 3.440000e+07 | 3.490000e+07 | 3.810000e+07 | 3.920000e+07 | 4.080000e+07 | 1.860000e+07 |
| 106 | Marshall Islands | 1.260000e+08 | 1.350000e+08 | 1.430000e+08 | 1.520000e+08 | 1.640000e+08 | 1.470000e+08 | 1.380000e+08 | 1.370000e+08 | 1.350000e+08 | ... | 1.510000e+08 | 1.620000e+08 | 1.610000e+08 | 1.570000e+08 | 1.630000e+08 | 1.610000e+08 | 1.640000e+08 | 1.660000e+08 | 1.730000e+08 | 4.700000e+07 |
| 110 | Micronesia, Fed. Sts. | 2.440000e+08 | 2.540000e+08 | 2.750000e+08 | 2.730000e+08 | 2.930000e+08 | 2.830000e+08 | 2.660000e+08 | 2.740000e+08 | 2.770000e+08 | ... | 2.900000e+08 | 2.970000e+08 | 3.060000e+08 | 3.010000e+08 | 2.900000e+08 | 2.830000e+08 | 2.960000e+08 | 2.990000e+08 | 3.070000e+08 | 6.300000e+07 |
| 88 | Kiribati | 1.220000e+08 | 1.230000e+08 | 1.240000e+08 | 1.260000e+08 | 1.260000e+08 | 1.280000e+08 | 1.300000e+08 | 1.390000e+08 | 1.370000e+08 | ... | 1.580000e+08 | 1.560000e+08 | 1.590000e+08 | 1.660000e+08 | 1.730000e+08 | 1.720000e+08 | 1.900000e+08 | 2.000000e+08 | 2.010000e+08 | 7.900000e+07 |
5 rows × 29 columns
Least 5 countries that have incurred the lowest GDP growth between 1991-2017 are:
fig4 = px.pie(df_combined_all, values='gdp', names='country', title='Countries shares of Total GDP overall years included in dataset')
fig4.update_traces(textposition='inside', textinfo='percent+label')
fig4.show()
The pie chart shows that USA have the highest total GDP share of the sum of the total GDP over the dataset years with a share of 25.3%
fig4 = px.pie(df_combined_all, values='elect_gen', names='country', title='Countries Total Electricity Generation Shares over total years 1991-2017')
fig4.update_traces(textposition='inside', textinfo='percent+label')
fig4.show()
The pie chart shows that USA responsible for the highest share of the total electricity generation of the world over the dataset years with a share of 24.3%
df_combined_all[['emp', 'country']]
df_MonthVsProfit= df_combined_all[['emp', 'country']]
df_MonthVsProfit= df_MonthVsProfit.groupby('country').mean()
df_MonthVsProfit = px.bar(df_MonthVsProfit, y="emp", title="Average employment rate distribution per Countries")
df_MonthVsProfit.show()
We can see from the graph that average employment rate differs from one country to another. Countries who have the best average of employment rate are:
fig = px.scatter(df_combined_all, x='owid', y='gdp', title= "The Correlation Between Education and Total GDP")
fig.show()
The Graph shows to a certain extent a week positive relation, however it doesn't conclude the exitance of relationship between GDP and OWID
fig = px.scatter(df_combined_all, x='emp', y='gdp', title= "The Correlation Between Employment Rate and Total GDP")
fig.show()
The graph shows no clear correlation
fig = px.scatter(df_combined_all, x='elect_gen', y='gdp', title= "The Correlation Between Electrical Energy and Total GDP")
fig.show()
According to the graph there could be a strong positive relation between Electrical energy infrastructure and GDP, however, it can conclude it
As a result of my data wrangling, cleaning, analysis and exploring I can conclude the research questions answers as follow:
Dataset Statistics shows that China is the country with the highest growth in total GDP with a total amount of 9295000000000 USD while Ukraine had the highest decline in GDP during the time period from 1991 till 2017.
Ukraine has a constant decline in GDP over the years and China GDP is rapidly growing.
The top 5 countries that have incurred the highest GDP growth between 1991-2017 are (China, USA, India, Japan and UK). Least 5 countries that have incurred the lowest GDP growth between 1991-2017 are (Ukraine, Tuvalu, Marshall Islands, Micronesia, Fed. Sts, Kiribati).
USA has the highest total GDP share of the sum of the total GDP over the dataset years with a share of 25.3%.
USA is accounted for the highest share of the total electricity generation of the world over the dataset years with a share of 24.3%.
that average employment rate differs from one country to another with some countries skewing higher than others, Countries who have the best average of employment rate are (United Arab Emirates with rate of 75.7%, Vietnam with rate of 75.6%, Iceland with rate of 72.9%).
when observing the relation between total GDP and other independent indicators, we find that total GDP and electricity generation seems to be directly proportional, however. we cannot conclude this without further studies and analysis for other possible reasons for the apparent relation.
This study has some limitations, some of these limitations are:
The study is limited to the countries and years included in final dataset, it takes the time period from 1991 to 2017 only and missing countries like Afghanistan for example.
Dropping missing or Null values from datasets might skew this study analysis and could show unintentional bias towards the relationship being analyzed. etc.
the study is limited to the given indicators and doesn't include other important indicators such as consumer price index for example
Our implications for future research regarding our research topic are as follow:
Including other indicators such as consumer price index, investment and export volume can help us understand more about factors effect GDP.
I also recommend you include other indicators for education such as education costs or number of educational institutes and its ratings to have more clear representation to the education indicator for a given country and does it relate to GDP
A list of references links used in these studies is attached in the zip file of this study as a readme text file